Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation
نویسندگان
چکیده
Integrating higher level visual and linguistic interpretations is at the heart of human intelligence. As automatic visual category recognition in images is approaching human performance, the high level understanding in the dynamic spatiotemporal domain of videos and its translation into natural language is still far from being solved. While most works on vision-to-text translations use pre-learned or pre-established computational linguistic models, in this paper we present an approach that uses vision alone to efficiently learn how to translate into language the video content. We discover, in simple form, the story played by main actors, while using only visual cues for representing objects and their interactions. Our method learns in a hierarchical manner higher level representations for recognizing subjects, actions and objects involved, their relevant contextual background and their interaction to one another over time. We have a three stage approach: first we take in consideration features of the individual entities at the local level of appearance, then we consider the relationship between these objects and actions and their video background, and third, we consider their spatiotemporal relations as inputs to classifiers at the highest level of interpretation. Thus, our approach finds a coherent linguistic description of videos in the form of a subject, verb and object based on their role played in the overall visual story learned directly from training data, without using a known language model. We test the efficiency of our approach on a large scale dataset containing YouTube clips taken in the wild and demonstrate state-of-the-art performance, often superior to current approaches that use more complex, pre-learned linguistic knowledge.
منابع مشابه
The Role of the Creative Industries: Translating Identities on Stages and Visuals
Drawing on research on narrative theory (Baker, 2006, 2014) in translation and interpretation studies, on the interdisciplinary relationship between translation studies and the visual and performing arts, and on the principal diversities between media discourse representations and aesthetic constructions on the topic of the migration crisis, this study addresses the issue of transferring cultur...
متن کاملGeneric Analysis of Literary Translation: A Case Study of Contemporary English Short Stories
Translation of a literary text is a difficult task, for understanding literature requires knowledge of various linguistic levels of a literary text in addition to strategies and methods of translation. To this should still be added cognitive-based translation training which helps practitioners preserve the aesthetic aspects of a literary text. Focusing on short story as a genre with both ...
متن کاملThe Effectiveness of Social Stories with Video Modeling on Communication Skills of Children with Autism Spectrum Disorder (ASD)
Background & purpose: Autism spectrum disorder is a developmental neurological disorder characterized by impairment in social and communication interactions and repetitive behaviors. The increasing prevalence of this disorder has raised the need for research on its characteristics and treatment methods. The aim of this study was to investigate the effect of social stories using video modeling...
متن کاملImmediate Effects of Different Screen Sizes on Visual Fatigue in Video Display Terminal Users
Background: Computer usage has rapidly grown. This is because it helps to resolve problems, i.e., encountered in daily life by individuals. Various monitor screens that have been developed affect the userchr('39')s eyes. Screen size is one of the relevant impacts. Thus, this study compared the immediate effects of two computer screen sizes on visual fatigue in Video Display Terminal (VDT) users...
متن کاملA Novel Approach to Background Subtraction Using Visual Saliency Map
Generally human vision system searches for salient regions and movements in video scenes to lessen the search space and effort. Using visual saliency map for modelling gives important information for understanding in many applications. In this paper we present a simple method with low computation load using visual saliency map for background subtraction in video stream. The proposed technique i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1511.06674 شماره
صفحات -
تاریخ انتشار 2015